August 12th, 2015

Acquiring PITCHf/x

library(pitchRx)
# returns a list of related tables (see diagram below)
dat <- scrape(start = "2008-01-01", end = Sys.Date())

Storing PITCHf/x

db <- dplyr::src_sqlite("pitchRx.sqlite3", create = TRUE)
pitchRx::scrape(start = "2008-01-01", end = Sys.Date(), connect = db$con)
  • Any database connection should work!
  • Writes data in streaming chunks to avoid exhausting memory.
  • Keeping your database up-to-date is also easy!
    update_db(db$con)

Animating PITCHf/x

Query/Animate PITCHf/x

Player/date info recorded on the at-bat level.

library(dplyr)
atbats <- tbl(db, 'atbat') %>%
  filter(pitcher_name == 'Yu Darvish', batter_name == 'Albert Pujols', 
         date == '2013_04_24')
  • Now, obtain PITCHf/x data for these at-bats.
    tbl(db, 'pitch') %>%
      inner_join(atbats, by = c('num', 'gameday_link')) %>%
      collect() %>% pitchRx::animateFX()

Modeling called strike decisions

Inspired from Brian Mills' Work

    # condition on umpire decisions
    pitches <- tbl(db, "pitch") %>%
      filter(des %in% c("Called Strike", "Ball")) %>%
      mutate(strike = as.numeric(des == "Called Strike"))
    # goal is to compare 2008 to 2014
    atbats <- tbl(db, "atbat") %>%
      mutate(year = substr(date, 5L, -4L)) %>%
      filter(year %in% c("2008", "2014"))
    dat <- left_join(pitches, atbats)
    library(mgcv)
    # 48 (2 x 2 x 12) surfaces!
    m <- bam(strike ~ interaction(stand, year, count) +
                s(px, pz, by = interaction(stand, year, count)),
              data = dat, family = binomial(link = 'logit'))

Visualizing differences

strikeFX(dat, model = m, density1 = list(year = "2008"),
          density2 = list(year = "2014"), 
          layer = facet_grid(count ~ stand))

Middle of the plate at the knees

Some takeaways

  • Called strikes at the knees were 2-4 times more likely in 2014 compared to 2008.

but called strikes up-and-in or up-and-away are much less likely nowadays.

Confidence/credible Intervals for GAMs

  • Form approximate CI on scale of the predictor, then transform to response scale.
    • Pros: computationally cheap
    • Cons: point-wise; approximate; assumes smoothness parameters are known
  • Simulate from the posterior and obtain percentiles
    • Pros: simultaneous (as opposed to point-wise)
    • Cons: assumes smoothness parameters are known
  • Parametric bootstrap
    • Pros: simultaneous (as opposed to point-wise); not conditional on smoothness parameters

Take Trey's Advice

Future Work

  • How do we test for a significant overall difference between scenario A and B?
  • Possible solution: Visual Inference

Thank you!

Special thanks to: * Brian Mills for comments/discussions on pitchRx and GAMs. * Mike Lopez for the invitation